As current students of the Information Systems program, and with diverse backgrounds, the main problem that a student faces is to decide in which direction should he/she forward his/her career. This is due to the reason that many of us have done an undergraduate in business, the software side, the electrical side, etc. Therefore this project will help them to anticipate what kind of work pays the most amount of money (P.S Money is very important :-P). This project will show the different kinds of salaries by college attended, program majored and even the region of living. In a nutshell, a student can decide their career plan based on our project.
Questions of Interest
What are the benefits of studying in the Ivy League as compared to other universities?
Which region has the best starting salary in the U.S at graduation and mid-career?
What are the salary figures of students from different universities ten years down the line?
What are the school rankings based on salaries now and even ten years after graduation?
What are the majors that offer the most salary increase in the coming years after graduation?
What are the significant factors that affect mid-career salary and how significant are they?
The current data we used is from Wall Street Journal, which is considered as a reliable source in the industry. There are couple constraint of the data that we want to inform readers of this project. First of all, the current dataset doesn't contain all the schools and majors in the U.S. Also, the data are collected several years ago so they may not be representative for the current situation. Therefore, all the conclusions we got in this project are based on the dataset.
In this project, we are not aimming to find the exact number of salary for each school, region or major but we would like to investigate the correlation between all these factors including starting salary, SAT score, mjors, school types and school regions. Moreover, we want to prove that we can forecast the mid-career salary with all these factors.
Data source: https://www.kaggle.com/wsj/college-salaries#salaries-by-college-type.csv
Package Installation. Major packages we used in this project is pandas and matplotlib.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
import sklearn
import seaborn as sns
import statsmodels.api as sm
import plotly.graph_objects as go
from geopy.geocoders import Nominatim
from bs4 import BeautifulSoup
import requests
geolocator = Nominatim(user_agent="specify_your_app_name_here")
import socket
s = socket.socket()
s.settimeout(1000)
We first read data in pandas dataframe and then cast string data from object to string and numeric data to float for further analysis and modification.
#User header to access the website
header = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36",
"X-Requested-With": "XMLHttpRequest"
}
poke_url = 'https://github.com/chshaw/SAT-ACT-Scores/blob/master/data/sat_2017.csv'
r = requests.get(poke_url, headers=header)
#Convert html into a dataframe
dfs = pd.read_html(r.text)
df = dfs[0].sort_values(by='State', ascending=True)
df = df.drop('Unnamed: 0',axis=1)
#Read state to get abrreviation for state
#state = pd.read_csv('states.csv')
#df['State'] = state['Abbreviation']
df.head()
college_type = pd.read_csv('salaries-by-college-type.csv')
college_type['School Type'] = college_type['School Type'].astype(str)
dummies = pd.get_dummies(college_type['School Type'])
college_type = pd.concat([college_type, dummies], axis = 1)
college_type = college_type.dropna()
college_type.head()
After reading in data, we need to do a further cleaning to make our data more readable and manipulatable.
cols = ['Starting Median Salary',
'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
'Mid-Career 25th Percentile Salary',
'Mid-Career 75th Percentile Salary',
'Mid-Career 90th Percentile Salary']
# Filter out unnecessary symbols for analysis
for col in cols:
college_type[col] = college_type[col].str.replace('$','')
college_type[col] = college_type[col].str.replace(',','')
#Cast string to float
for j in range(2,8):
for i in range(len(college_type)):
college_type.iloc[i,j] = float(college_type.iloc[i,j])
#Clean and categorize data
college_type_engineering = college_type[college_type['Engineering'] == 1].iloc[:,0:-5]
college_type_IL = college_type[college_type['Ivy League'] == 1].iloc[:,0:-5]
college_type_LA = college_type[college_type['Liberal Arts'] == 1].iloc[:,0:-5]
college_type_state = college_type[college_type['State'] == 1].iloc[:,0:-5]
college_type_party = college_type[college_type['Party'] == 1].iloc[:,0:-5]
college_type_engineering_mean = college_type_engineering.mean()
college_type_IL_mean = college_type_IL.mean()
college_type_LA_mean = college_type_LA.mean()
college_type_state_mean = college_type_state.mean()
college_type_party_mean = college_type_party.mean()
college_type_salary_mean = [college_type_engineering_mean, college_type_IL_mean, college_type_LA_mean, college_type_state_mean, college_type_party_mean]
college_type_salary = pd.DataFrame(college_type_salary_mean).T
college_type_salary.columns = list(college_type['School Type'].unique())
college_type_salary
#draw a barplot about the salary of different school type
college_type_salary.plot(kind = 'bar',figsize = (20,10))
plt.ylabel('Salary', fontsize = 25)
plt.xticks(rotation = 'vertical',fontsize = 17)
plt.yticks(fontsize = 20)
plt.title('Salary of different school type', fontsize = 25)
plt.legend(fontsize = 20)
plt.show()
From the bar polt, we are suprised to see that party schools have the highest salary in most periods of time. What's more, the line chart also tells us that party schools have a more rapid increasing speed of salary, which is contrary to what we think normally.
We continue to analyze salary distribution according to different regions in the U.S.
#prepare the data for radar plot
c_t = college_type_salary.T
c_t_eng = c_t.loc['Engineering']
c_t_party = c_t.loc['Party']
c_t_LA = c_t.loc['Liberal Arts']
c_t_IL = c_t.loc['Ivy League']
c_t_S = c_t.loc['State']
c_t_S
!!pip install pygal
#draw a barplot about the salary of different school types
import pygal
#call radar and set the parameters
radar_chart = pygal.Radar(fill = True, range=(10000,300000))
radar_chart.title = 'Salary By School Type'
#set label for each angle
radar_chart.x_labels = ['Starting','Mid-Career 10th','Mid-Career 25th','Mid-Career','Mid-Career 75th','Mid-Career 90th']
#draw the lines
radar_chart.add('Engineer', c_t_eng)
radar_chart.add('Party', c_t_party)
radar_chart.add('Liberal Arts', c_t_LA)
radar_chart.add('Ivy League', c_t_IL)
radar_chart.add('State', c_t_S)
radar_chart.render_to_file('radar_chart.svg')
# dsiplay the radar plot
from IPython.display import SVG, display
def show_svg():
display(SVG('radar_chart.svg'))
show_svg()
From the radar plot, we can see that
1.A funny one: The students in party school have the highest salaries in almost every stage and the students in Ivy league have the lowest ones.
2.The salary difference in party school is largest, and the difference in Ivy League is smallest.
We think the reason why we get that conclusion is the admission criteria and level of majors in party schools varies greatly, so the salary after they graduate varies greatly, and the high salary level students will raise the mean salary. On the contrary, the Ivy League school has close level students and majors, so their salaries are close and more reasonable.
college_region = pd.read_csv('salaries-by-region.csv')
college_region['Region'] = college_region['Region'].astype(str)
dummies = pd.get_dummies(college_region['Region'])
college_region = pd.concat([college_region, dummies], axis = 1)
#college_type.drop('Party', axis = 1, inplace = True)
college_region = college_region.dropna()
college_region.head()
for col in cols:
college_region[col] = college_region[col].str.replace('$','')
college_region[col] = college_region[col].str.replace(',','')
for j in range(2,8):
for i in range(len(college_region)):
college_region.iloc[i,j] = float(college_region.iloc[i,j])
college_type_engineering_mean = college_type_engineering.mean()
college_type_IL_mean = college_type_IL.mean()
college_type_LA_mean = college_type_LA.mean()
college_type_state_mean = college_type_state.mean()
college_type_party_mean = college_type_party.mean()
college_type_salary_mean = [college_type_engineering_mean, college_type_IL_mean, college_type_LA_mean, college_type_state_mean, college_type_party_mean]
college_type_salary = pd.DataFrame(college_type_salary_mean).T
college_type_salary.columns = list(college_type['School Type'].unique())
college_type_salary
college_region_CA = college_region[college_region['California'] == 1].iloc[:,0:-5]
college_region_W = college_region[college_region['Western'] == 1].iloc[:,0:-5]
college_region_MW = college_region[college_region['Midwestern'] == 1].iloc[:,0:-5]
college_region_S = college_region[college_region['Southern'] == 1].iloc[:,0:-5]
college_region_NE = college_region[college_region['Northeastern'] == 1].iloc[:,0:-5]
college_region_CA_mean = college_region_CA.mean()
college_region_W_mean = college_region_W.mean()
college_region_MW_mean = college_region_MW.mean()
college_region_S_mean = college_region_S.mean()
college_region_NE_mean = college_region_NE.mean()
college_region_salary_mean = [college_region_CA_mean, college_region_W_mean,college_region_MW_mean,college_region_S_mean,college_region_NE_mean]
college_region_salary = pd.DataFrame(college_region_salary_mean).T
college_region_salary.columns = list(college_region['Region'].unique())
college_region_salary
Because latitide and longitude of schools are needed in the following analysis, we decide to extract them at first. As a result, we can improve the analysis corresponsding to regions further.
# Merge data together
college_type2 = college_type.drop(['Engineering', 'Ivy League', 'Liberal Arts', 'Party', 'State'], axis = 1)
college_region2 = college_region.drop(['California', 'Midwestern','Northeastern', 'Southern', 'Western'], axis = 1)
college_merge = pd.merge(left = college_type2, right = college_region2, left_on = 'School Name', right_on = 'School Name')
college_merge = college_merge.dropna()
college_merge2 = college_merge.iloc[:, 0:9]
#There are some location can not be searched by geopy for some reason.
#The following modification are tested in google map
#Manually changed some addresses to other format
college_merge2['Name'] = college_merge2['School Name'].str.replace(r"\(.*\)","")
college_merge2['Name'] = college_merge2['Name'].str.replace("State University of New York ","SUNY")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Alaska, Anchorage","University of Alaska Anchorage")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Nevada, Las Vegas","4505 S Maryland Pkwy, Las Vegas, NV 89154")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Texas at El Paso","500 W University Ave, El Paso, TX 79968")
college_merge2['Name'] = college_merge2['Name'].str.replace("California State University, Dominguez Hills","California State University Dominguez Hills")
college_merge2['Name'] = college_merge2['Name'].str.replace("California State University, Sacramento","6000 J Street, Sacramento, CA 95819")
college_merge2['Name'] = college_merge2['Name'].str.replace("SUNY at Farmingdale","Farmingdale State College")
college_merge2['Name'] = college_merge2['Name'].str.replace("State University of New York ","SUNY")
college_merge2['Name'] = college_merge2['Name'].str.replace("Bowling Green State University","Bowling Green, OH 43403")
college_merge2['Name'] = college_merge2['Name'].str.replace("Georgia Institute of Technology","North Ave NW, Atlanta, GA 30332")
college_merge2['Name'] = college_merge2['Name'].str.replace("Polytechnic University of New York, Brooklyn","Brooklyn, NY 11201")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Maryland, College Park","University of Maryland-College Park")
college_merge2['Name'] = college_merge2['Name'].str.replace("SUNY at Potsdam","State University of New York at Potsdam")
college_merge2['Name'] = college_merge2['Name'].str.replace("SUNY at Oneonta","108 Ravine Pkwy, Oneonta, NY 13820")
college_merge2['Name'] = college_merge2['Name'].str.replace("SUNY at Geneseo","State University of New York at Geneseo")
college_merge2['Name'] = college_merge2['Name'].str.replace("California State University, Fullerton","800 N State College Blvd, Fullerton, CA 92831")
college_merge2['Name'] = college_merge2['Name'].str.replace("Texas A&M University","400 Bizzell St, College Station, TX 77843")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of Alabama at Huntsville","301 Sparkman Dr NW, Huntsville, AL 35899")
college_merge2['Name'] = college_merge2['Name'].str.replace("University of California at Los Angeles","California, Los Angeles, 90095")
#Temporary lists to store information of latitude and longitude
latitude = []
longitude = []
states =[]
counter = 0
#This is an access to search location
geolocator = Nominatim(user_agent="specify_your_app_name_here")
#Iterate through each rows
for i, x in college_merge2.iterrows():
#Using try except to extend socket time and prevent timeout error
try:
#Using package geopy to get longitude and latitude
location = geolocator.geocode(x['Name'])
latitude.append(location.latitude)
longitude.append(location.longitude)
state = location.address.split(',')
states.append(state[-3])
except:
location = geolocator.geocode(x['Name'])
latitude.append(location.latitude)
longitude.append(location.longitude)
state = location.address.split(',')
states.append(state[-3])
college_merge2['Latitude'] = latitude
college_merge2['Longitude'] = longitude
college_merge2['State'] = states
#Take the difference of students in different locations into consideration
df.State = df.State.str.replace(' ','') # Standardize format of locations
college_merge2.State = college_merge2.State.str.replace(' ','')
college_merge_final = df.merge(college_merge2, left_on = 'State', right_on = 'State')
college_merge_final = college_merge_final.dropna()
# Manually pick the first 8 states among all levels
college_state = college_merge_final.iloc[:,[0,5,7,8,9,10,11,12]]
st = ['NewJersey', 'Massachusetts', 'Connecticut','Pennsylvania', 'Maryland','RhodeIsland','NewHampshire','SouthDakota','Colorado']
temp_state = college_state.groupby(by = 'State').mean()
college_state_salary = pd.DataFrame()
for i in st:
college_state_salary[i] = temp_state.loc[i,:].T
college_state_salary.index = college_state_salary.index.str.replace('_x','')
college_state_salary.plot(kind = 'barh', figsize = (25,20))
plt.title('Salary by States', fontsize = 30 )
plt.xlabel('Salary', fontsize = 30)
plt.yticks(fontsize = 25)
plt.xticks(fontsize = 25)
plt.legend(fontsize = 20)
Hence from the graphs above we can see that california and northeastern regions have the best starting and mid career salary in the U.S., which is close to our impression to those areas. To be more specific, the first 8 states that have more salary assemble in the northeast, which indicates a clearer conclusion for graduates.
After seperate analysis about regions and types of universities, we are curious about the comprehensive effect of the two factors, so we are going to combine them to make a deeper analysis.
Create a new variable 'Increasing rate' to reveal the potential career developing ability.
# Give priority to types of schools
college_merge2['Increasing rate'] = college_merge2['Mid-Career Median Salary_x']/college_merge2['Starting Median Salary_x']
college_merge3 = college_merge2.set_index(['School Type', 'State'])
college_merge3 = college_merge3.sort_values(by = 'Increasing rate', ascending = False)
college_merge3.head(10)
# Give priority to school regions
college_merge4 = college_merge2.set_index(['State','School Type', ])
college_merge4 = college_merge4.sort_values(by = 'Increasing rate', ascending = False)
college_merge4.head(10)
From the perspective of school type, although party schools have the amazingly highest mid-career salary, Ivy League and Liberal Arts schools have the highest salary increasing rate. From the other perspective, schools in Indiana, New Hampshine and Connecticut own the fasteast increasing rate, who are almost in the northeastern region.
As a result, for students who would like to get a fast growth in their career, they would better to choose Ivy League and Liberal Arts schools in Indiana, New Hampshine and Connecticut. On the other hand, the result also answer our question that Ivy League owns a better location which has a better academic atmosphere. This is the benefit to study there.
Since major-choosing is more subjective and isolated from regions and type of schools, we decide to analyze major seperately.
college_major = pd.read_csv('degrees-that-pay-back.csv')
#Since NaN doesn't have function replace, we convert it to 0 for now
for col in cols:
college_major[col] = college_major[col].str.replace('$','')
college_major[col] = college_major[col].str.replace(',','')
tem = ['Starting Median Salary','Mid-Career Median Salary','Mid-Career 10th Percentile Salary','Mid-Career 25th Percentile Salary','Mid-Career 75th Percentile Salary','Mid-Career 90th Percentile Salary']
college_major[tem] = college_major[tem].astype(float)
tem =['Starting Median Salary','Mid-Career Median Salary','Mid-Career 75th Percentile Salary']
college_major1 = college_major.sort_values(by=tem, ascending=False)[0:15]
college_major1 = college_major1.reset_index()
from matplotlib import rc
# y-axis in bold
rc('font', weight='bold')
# Values of each group
bars1 = college_major1['Mid-Career 90th Percentile Salary']
bars2 = college_major1['Mid-Career Median Salary']
bars3 = college_major1['Starting Median Salary']
# Heights of bars1 + bars2
bars = np.add(bars1, bars2).tolist()
# The position of the bars on the x-axis
r = [x for x in range(0,15)]
# Names of group and bar width
names = college_major1['Undergraduate Major']
barWidth = 1
# Create brown bars
plt.bar(r, bars1, color='#7f6d5f', edgecolor='white', width=barWidth, label='Mid-Career 90th Percentile Salary')
# Create green bars (middle), on top of the firs ones
plt.bar(r, bars2, bottom=bars1, color='#557f2d', edgecolor='white', width=barWidth, label='Mid-Career Median Salary')
# Create green bars (top)
plt.bar(r, bars3, bottom=bars, color='#2d7f5e', edgecolor='white', width=barWidth, tick_label='Starting Median Salary', label='Starting Median Salary')
# Custom X axis
plt.title('Salary for Different Stages of Top 15 Majors', fontsize = 15)
plt.xticks(r, names,rotation='vertical',fontsize = 15)
plt.xlabel(['High to low: Mid-Career 90th Percentile Salary','Mid-Career Median Salary','Starting Median Salary'])
plt.yticks(fontsize = 15)
plt.rcParams['figure.figsize'] = (9.0, 8.0)
# Show graphic
plt.legend(fontsize = 10)
According to the plot, we can clearly see the top 15 majors with the highest salary in the U.S. In average, students studying the chemical engineering earns the highest salary. Also, we are glad to see Information Systems is one of the top 15 majors with high salary.
Plot the result of rankings of different majors
pip install squarify
import squarify
college_major2 = college_major.sort_values('Percent change from Starting to Mid-Career Salary', ascending=False)[0:15]
height = college_major2['Undergraduate Major']
bars = college_major2['Percent change from Starting to Mid-Career Salary']
y_pos = np.arange(len(bars))
# Create horizontal bars
plt.barh(y_pos,bars)
# Create names on the y-axis
plt.title('Percent change from Starting to Mid-Career Salary of Top 15 Majors', fontsize = 20 )
plt.yticks(y_pos, height, fontsize = 15)
plt.ylabel("Major",fontsize = 15)
plt.xlabel('Precent Change Rate',fontsize = 15)
plt.rcParams['figure.figsize'] = (10.0, 10.0)
# Show graphic
plt.show()
From this graph we can see that Math, Philosophy, International Relations, Economics and Marketing are the top 5 undergraduate majors that offer the most salary increase down the line. And Majors like nursing and physician assistant are among the ones which do not see much salary hike in the coming years. Also, we are sad to notice that IS is not in the top list of increasing rate from starting to mid-career.
Hence from the above two graphs you can see that physician assistant has the hugest starting median salary in all branches but when talking about percent change from starting to mid career salary then it is the least. So even if is the most in the beginning one cannot expect much increment after that.
Among all the school types we can see that Ivy league students get higher salary on graduating and even after 10 years. And so there are benfits of studying in Ivy league as compared to the other category of colleges.
For guiding graduates more directly, we use mapbox to create a 'salary map' as a guide to show them which regions have a better developing chance.
# Cast Mid-career median salary into string temperarily to make it consistant with school name to be the label in our map
college_merge4['Mid-Career Median Salary_x_1'] = college_merge4['Mid-Career Median Salary_x'].astype(str)
mapbox_access_token = 'pk.eyJ1IjoiamNuMTk5NzI5IiwiYSI6ImNrMm9xNzdiOTE2OXYzY205OG52ZmxhZGQifQ.O2FQnsWlbbYwKPtU0rBTlQ'
college_map_data = go.Scattermapbox(
lon = college_merge4['Longitude'],
lat = college_merge4['Latitude'],
text = college_merge4['School Name'] + '<br>' + 'Mid-Career Median Salary: ' + college_merge4['Mid-Career Median Salary_x_1'],
hoverinfo='text',
mode = 'markers',
marker = dict(
color = college_merge4['Mid-Career Median Salary_x'],
symbol = 'circle',
opacity = .5
)
)
college_map_layout = go.Layout(
title = 'Mid-Career Salary of Universities In U.S',
mapbox=go.layout.Mapbox(
accesstoken=mapbox_access_token,
zoom=1
)
)
college_map = go.Figure(data=college_map_data, layout=college_map_layout)
college_map.show()
As the map shows, the points in lighter color represent a higher mid-career salary, and the darker points represent a lower level. The map confirm our conclusion that northeastern of America does have a higher salary level, and help graduates a clear guide when choose the working location to pursue a better development.
We are looking at the impacts of:
The salary of mid-career.
To test the null hypothesis, we will perform a Multiple Linear Regression on our dataset by using scikit-learn.
We will use our tidied datasets for the features and the target.
Features: Mid-career salary Target: dependent variables.
Since region and school type are categorical variables, we will create dummy variable for both variabls. To achieve that, we used pd.get_dummies() in this case.
After setting up the data, we will set up X and Y for our use in scikit-learn LinearRegression() function. Then, fit the model.
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from sklearn import linear_model
from sklearn import model_selection
from statsmodels import api as sm
Ordinary Least Squares Regression
tem = ['Region','School Type','Starting Median Salary_x', 'Math','Evidence-Based Reading and Writing']
df_tem = college_merge_final[tem]
X = pd.get_dummies(df_tem,drop_first=True)
X = sm.add_constant(X)
y = college_merge_final['Mid-Career Median Salary_x']
linear = linear_model.LinearRegression()
model = linear.fit(X,y)
linear.score(X,y)
R square is around 0.88, which means that 88% of the variation is explained by our model.
coef = linear.coef_.tolist()
tem = list(X.columns)
for f, c in zip(tem, coef):
print("Feature: {}, Coefficient: {}".format(f,c))
From the coefficients above, we can confirm our pervious estimation. We can see that, in average, studying in an Ivy-league school has the highest positive impact on your future salary. Also, students graduated from Liberal Arts tend to make more money than party and state school in average.
On the other hand, we are surprised to see that only northeastern has a postive coefficient. In average, students graduated from other regions tend to have negative impact on their career. We specilate that can be caused by most Lvy-league schools are located in Northeastern.
We want to observe more about this result by p-values, t-values, and std-err. Then, we need Ordinary Least Squares Regression model.
model = sm.OLS(endog=y, exog=X)
results = model.fit()
results.summary()
Given the assumption that the covariance matrix of the errors is correctly specified, we may say that the standard Errors are small enough for a good performance.
Looking at the 'coef' column above, we observe that obviously the School Type_Ivy League has significant impact on the mid-career salary. Also, the p-value of School Type_Ivy League is 0, perfectly below a critical value of 0.05, which confirms more about the findings.
In addition, we are more condifent that it is not true only student graduated from northeastern tend to have a positive impact in their career in average. Based on the observation of the p-values above ('P>|t|' column in the middle chart), we can see that the p-value of Region_Northeastern is way greater than a critical value of 5% (p-value of 0.05).
Also, the p-value of SAT verbal and math are all bigger than 5%. Therefore, we conclude that SAT score are not a significant factors for predicting mid-career salary. We should remove this two variables from the model.
All these mean that we should reject the null hypothesis because obviously the school type and region have significant impact on mid-career salary.
Here we will split up the dataset into training and testing data for both variables. The percentage of Test/train split depends on different programmer. For our data, we will use 20% of the data for testing purpose and 80% for training purposes. We will generate mutiple linear regression model using the training data first. Then, we will make the predictions for the testing data. After that, we will compare the predictions with the actual results (Y_test).
from sklearn.model_selection import train_test_split, KFold, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
# Create train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=12345)
# create and fit the linear model
linear = linear_model.LinearRegression()
model = linear.fit(X_train, y_train)
# generate predictions for player Rating to compare with y_test data
predictions = linear.predict(X_test)
# display first 5 results of our predicted mid-career salary.
predictions[0:5]
plt.figure(figsize=(12,12))
plt.title("Predicted vs. Actual Values for Mid-career Salary", fontsize=16)
plt.scatter(y_test,predictions)
plt.plot(y_test, y_test, color="Red") # identity line y=x
plt.xlabel("Actual Values")
plt.ylabel("Predicted Values")
plt.show()
f, ax = plt.subplots(figsize=(13,10))
plt.title('Data Distribution for Actual and Predicted')
sns.distplot(y_test, hist=False, label="Actual", ax=ax)
sns.distplot(predictions, hist=False, label="Linear Regression Predictions", ax=ax)
plt.show()
print("Accuracy (scale of 0 to 1): {}".format(model.score(X_test, y_test)))
From two plots above, we obviously observe that our model predicted the mid-career salary very well for the testing dataset. Our model also provides a score of the actual accuracy (1 is perfect), and our accuracy is around 0.88, which is a consideratable high accuracy score for the model.
#draw a bar plot of the salary increase by school type
college_type_salary = college_type_salary.reindex(index = ['Starting Median Salary','Mid-Career 10th Percentile Salary','Mid-Career 25th Percentile Salary','Mid-Career Median Salary','Mid-Career 75th Percentile Salary','Mid-Career 90th Percentile Salary',])
xn = ['salary_increase']
college_type_rise = pd.DataFrame(columns = ['Engineer', 'Party', 'Liberal Arts', 'Ivy League','State'])
college_type_rise.loc[1] = [college_type_salary.iloc[3,i] -college_type_salary.iloc[0,i] for i in range(len(college_type_salary.columns))]
college_type_rise.plot(kind = 'bar')
plt.xticks([1],xn)
plt.ylabel('Salary', fontsize = 17)
plt.xlabel('Schoole types', fontsize = 17)
plt.title('salary increase by school type')
college_region_salary.plot(kind = 'bar', figsize = (20,10))
plt.ylabel('Salary', fontsize = 30)
plt.xticks(rotation = 'vertical', fontsize = 17)
plt.yticks(fontsize = 20)
plt.legend(fontsize = 20)
plt.show()